Method and system for visualizing information extracted from big data

ABSTRACT

The various embodiments herein describe a method for providing information visualization comprising identifying a plurality of events from a big data, calculating a temporal distance between at least two events, calculating a semantic distance between the at least two events, storing the calculated semantic distance and the temporal distance with respect to the at least two events in a data structure, providing the semantic distance and the temporal distances calculated between the at least two events on a visual representation, identifying one or more relevant events with respect to the target event based on the semantic distance and the temporal distance through visualization, selecting a plurality of events that influence the target event and examining a pattern of the plurality of events to realize possible collinearity of the events to further reduce the influencing events in order to facilitate feature selection through visualization.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority of Indian provisional application serial number 3286/CHE/2012 filed on Aug. 10, 2012, and that application is incorporated in its entirety at least by reference.

BACKGROUND

1. Technical Field

The embodiments herein generally relate to data mining and particularly relates to a method for extracting and processing events from a large collection of data. The embodiments herein more particularly relates to a method and system for visualizing information by realizing the influence of one or more events on a target event.

2. Description of the Related Art

An entity is an unit of data which has an independent self-explanatory meaning, and is also referred as an object that makes an independent sense. A relationship is a property which describes an association between two or more entities. The relationship between two or more entities helps in understanding the characteristics and behavior of the entities. An event is a relationship that occurs between a entity and a time entity, simply, it is a relationship with respect to time. The big data is a large collection of data that comes from structured, unstructured and semi-structured data sources. In an analytics context, the entities, relationships and events manifest as variables or features.

In big data analytics, the influence of other variables or features on a given variable is often studied to make a prediction of the value or state of the variable. This is a typical feature selection problem that is magnified in the context of big data analytics because of the large number of features/variables available. The existing technology discusses various feature selection techniques specific to different context. Feature selection is quite a complex process and therefore a large number of efforts have gone into addressing specific problems related to feature selection in different domains.

The current feature selection procedures are mostly based on machine learning/statistics/data mining techniques. All these efforts require a good understanding of machine learning, statistics techniques and also of the problem domain. However, the existing techniques do not provide a simple, generic strategy that can be applied to all contexts.

In big data analytics, the problem of predicting the occurrence of events often comes across. The occurrence of a given event is greatly influenced by the occurrences of many other events that happen simultaneously. However, not all of the events bear equal influence on the target event. To predict the occurrence or the state of a target event, it is important to identify the events that bear high influence on the target event. This reduces the problem to that of feature selection or dimension reduction. Feature selection typically requires utilization of domain knowledge combined with knowledge of statistics, data mining and machine learning.

Hence, there is a need for a method and system for visualizing the influence of one or more events on a target event interactively. Also, there is a need for a method and system for performing feature selection without the need for deep understanding of statistics, machine learning or the problem domain. Further, there is a need for a method and system for providing effective visualization of the feature selection. Moreover, there is a need for a method and system for enabling feature selection from various context perspectives.

The abovementioned shortcomings, disadvantages and problems are addressed herein and which will be understood by reading and studying the following specification.

SUMMARY

The primary object of the embodiments herein is to provide a method and system for visualizing the influence of one or more events on a target event interactively.

Also, there is a need for a method and system for providing a simple and intuitive method of feature selection which does not require an understanding of statistics, machine learning or the problem domain.

Another object of the embodiments herein is to provide a method and system which enables effective feature selection to identify the relevant events that influence the target event.

Another object of the embodiments herein is to provide a method and system for visualizing information which employs an information dimension for representing information based on semantic and temporal relatedness.

Another object of the embodiments herein is to provide a method and system for computing semantic distance and temporal distance between one or more influencing events and a target event.

Another object of the embodiments herein is to provide a method and system for representing information on semantic vs. temporal distance axes.

These and other objects and advantages of the present embodiments will become readily apparent from the following detailed description taken in conjunction with the accompanying drawings.

The various embodiment herein describe a method for providing information visualization comprising, identifying a plurality of events from a big data, calculating a temporal distance between at least two events, calculating a semantic distance between the at least two events, storing the calculated semantic distance and the temporal distance with respect to the at least two events in a data structure, providing the semantic distance and the temporal distances calculated between the at least two events on a visual representation, identifying one or more relevant events with respect to the target event based on the semantic distance and the temporal distance through visualization, selecting a plurality of events that influence the target event and examining a pattern of the plurality of events to realize possible collinearity of the events to limit the influencing events considered for feature analysis.

According to an embodiment herein, realizing the collinearity of the events comprises identifying at least two events sharing an exact relationship and reducing the number of variables to ensure effective feature selection.

According to an embodiment herein, an event is defined as a relationship which occurred at an instant of time.

According to an embodiment herein, the method for providing information visualization further comprises selecting a new target event and re-visualizing the influence of the selected features on the new target event.

According to an embodiment herein, the feature selection is defined as selecting a plurality of variables having an influence on the occurrence of the event.

According to an embodiment herein, calculating the temporal distance between at least two events comprise computing temporal correlation between the at least two events. The temporal correlation is measure of correlation of the events across time.

According to an embodiment herein, the temporal distance is calculated on the basis of time series data obtained from big data.

According to an embodiment herein, calculating the semantic distance between the two events comprises at least one of measuring a contextual distance between one or more words if the plurality of events are described by the words, measuring a contextual similarity or a semantic similarity as provided by a domain model and a language model and measuring the contextual or the semantic similarity based on analyzing the relationships and the entities.

According to an embodiment herein, the semantic distance is calculated using structured data and an unstructured data or a combination of structured data and unstructured data.

According to an embodiment herein, the method for providing information visualization further comprises storing the semantic distance and the temporal distances between the at least two events in a data structure capable of storing both the semantic distances and the temporal distances separately.

According to an embodiment herein, a value corresponding to the semantic distance and the temporal distance is in a preset numerical range or correspond to a discrete values range, with an event being closest to itself both temporally and semantically.

According to an embodiment herein, the representation of the visualization comprises events represented by a predefined shape, negatively co-related events differentiated by the shape, different events or event types represented by different colors, highly influential events arranged around the target event and less influential events arranged away from the targeted event as they approach the origin.

According to an embodiment herein, the method for providing information visualization further comprises setting limits for semantic distances and the temporal distances for the plurality of events influencing the target event and selecting the events falling in the defined limit as the highly influential events.

The various embodiments herein describe a system for providing information visualization comprising an event extractor to extract a plurality of events from a big data, a semantic distance estimator to calculate a temporal distance between at least two events. The system further comprises a temporal distance estimator to calculate a temporal distance between the at least two events, a data structure to store the calculated temporal distance and the semantic distance with respect to the a pair of events, an user interface provided on a user device to display one or more relevant events with respect to the target event based on the semantic distance and the temporal distance and provide an interactive input to the data structure to select a plurality of events that influence the target event, and display unit for visualizing the influence of the selected features on the target event.

According to an embodiment herein, the temporal distance estimator calculates the temporal distance between a pair of events by computing a temporal correlation between the a pair of events across time.

According to an embodiment herein, the semantic distance estimator calculates the semantic distance between the two events by at least one of measuring a contextual distance between one or more words if the plurality of events are described by the words, measuring a contextual similarity or a semantic similarity as provided by a domain model and a language model and measuring the contextual similarity or the semantic similarity based on analyzing the relationships and the entities.

According to an embodiment herein, the semantic distances and the temporal distances are stored in a data structure which allows separate storage and easy retrieval of each of the semantic distance and the temporal distance with respect to an event pair.

These and other aspects of the embodiments herein will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating preferred embodiments and numerous specific details thereof are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments herein without departing from the spirit thereof, and the embodiments herein include all such modifications.

BRIEF DESCRIPTION OF THE DRAWINGS

The other objects, features and advantages will occur to those skilled in the art from the following description of the preferred embodiment and the accompanying drawings in which:

FIG. 1 is a block diagram illustrating a process for visualizing the influence of one or more events on a target event, according to an embodiment of the present disclosure.

FIG. 2 is a block diagram illustrating a process for computing temporal distance between events, according to an embodiment of the present disclosure.

FIG. 3 is a block diagram illustrating a process for computing semantic distance events, according to an embodiment of the present disclosure.

FIG. 4 is a flow chart illustrating a method for visualizing the effect of events on a target event, according to an embodiment of the present disclosure.

FIG. 5 is a graph showing an example illustration of the interactive visualization of events, according to an embodiment of the present disclosure.

Although the specific features of the present embodiments are shown in some drawings and not in others. This is done for convenience only as each feature may be combined with any or all of the other features in accordance with the present embodiments.

DETAILED DESCRIPTION OF THE DRAWINGS

In the following detailed description, a reference is made to the accompanying drawings that form a part hereof, and in which the specific embodiments that may be practiced is shown by way of illustration. These embodiments are described in sufficient detail to enable those skilled in the art to practice the embodiments and it is to be understood that the logical, mechanical and other changes may be made without departing from the scope of the embodiments. The following detailed description is therefore not to be taken in a limiting sense.

The various embodiment herein describe a method for providing information visualization comprising, identifying a plurality of events from a big data, calculating a temporal distance between at least two events, calculating a semantic distance between the at least two events, storing the calculated semantic distance and the temporal distance with respect to the at least two events in a data structure, providing the semantic distance and the temporal distances calculated with respect to a target event on a visual representation, identifying one or more relevant events with respect to a target event based on the semantic distance and the temporal distance through visualization, selecting a plurality of events that influence the target event, and examining a pattern of the plurality of events to realize possible collinearity among events to reduce number of influencing events to assist feature selection. Here the collinearity of the plurality of events is suggested by approximately same semantic and temporal distances on the visualization scheme. The collinear events as defined herein are likely to have the same semantic and temporal distance from the target event and hence appear very close to each other or one on top of the other in visualization.

The method for providing information visualization further comprises selecting a new target event and re-visualizing the influence of the plurality of events on the new target event. Here an event is defined as a relationship which occurred at an instant of time and feature selection is defined as selecting a plurality of variables having an influence on the occurrence of the event.

The method of calculating the temporal distance between at least two events comprise computing temporal correlation between the at least two events. The temporal correlation is measure of correlation of the events across time. The temporal distance is calculated on the basis of time series data obtained from one of a structured data, an unstructured data or a combination of structured data and unstructured data.

The method of calculating the semantic distance between the two events comprises at least one of measuring a contextual distance between one or more words if the plurality of events are described by the words, measuring a contextual similarity or a semantic similarity as provided by a domain model and a language model and measuring the contextual or the semantic similarity based on analyzing the relationships and the entities. The semantic distance is calculated on the basis of structured data and an unstructured data or a combination of structured data and unstructured data.

The method for providing information visualization further comprises storing the semantic distance and the temporal distances between the at least two events in a data structure capable of storing both the semantic distances and the temporal distances separately.

The value corresponding to the semantic distance and the temporal distance is in a preset numerical range or correspond to a discrete values range, with an event being closest to itself both temporally and semantically.

The representation of the visualization comprises events represented by a predefined shape, negatively co-related events differentiated by the shape, different events represented by different colors, different type of events represented by different colors, highly influential events arranged around the target event, and less influential events arranged away from the targeted event as they approach the origin.

The method for providing information visualization further comprises setting limits for semantic distances and the temporal distances for the plurality of events influencing the target event and selecting the events falling in the pre-set limit as the highly influential events.

The various embodiments herein describe a system for providing information visualization comprising an event extractor to extract a plurality of events from a big data, a semantic distance estimator to calculate a temporal distance between at least two events. At least one event among the events is a target event. The system further comprises a temporal distance estimator to calculate a temporal distance between the at least two events, a data structure to store the calculated semantic distance and the semantic distance with respect to the at least two events, an user interface provided on a user device to display one or more relevant events with respect to the target event based on the semantic distance and the temporal distance and provide an interactive input to the data structure to select a plurality of events that influence the target event, and display unit for visualizing the influence of the selected features on the target event.

The temporal distance estimator calculates the temporal distance between at least two events by computing a temporal correlation between the at least two events across time.

The semantic distance estimator calculates the semantic distance between the two events by at least one of: measuring a contextual distance between one or more words if the plurality of events are described by the words, measuring a contextual similarity or a semantic similarity as provided by a domain model and a language model and measuring the contextual similarity or the semantic similarity based on analyzing the relationships and the entities.

The semantic distances and the temporal distances are stored in a data structure which allows separate storage and easy retrieval of each of the semantic distance and the temporal distance with respect to an event-target event pair.

FIG. 1 is a block diagram illustrating a process for visualizing the influence of one or more events on a target event, according to an embodiment of the present disclosure. The big data 101 represents a big data from which one or more events are selected for analysis. The big data 101 comprises plurality of events and variables that influence an event. The big data 101 also comprises one or more events specifically termed as target events on which influence is to be visualized. The event extractor 102 processes the big data 101 and extracts one or more events which are specific for analysis. The extracted event is processed by a semantic distance estimator 103 and a temporal distance estimator 104. The semantic distance estimator 103 calculates a semantic distance between the target event and one or more other events. Similarly, the temporal distance estimator 104 calculates a temporal distance between the target event and at least one other event. After computing both the semantic distance and the temporal distance, the computed values are stored in a data structure 105. The computed value of semantic distance and the temporal distance with respect to any given pair of event is stored in the data structure 105. The data structure 105 allows separate storage and easy retrieval, of each of the semantic distance and the temporal distance with respect to an event-target pair. A user device communicates with the data structure 105 and retrieves the computed value for display. The user device includes a user interface 106 by which a user is able to query an interactive input to the data structure 105 for selecting a plurality of events that influence a target event according to their preferences. The user interface 106 is in conjunction with a display/visualization unit 107 for displaying one or more relevant events with respect to the target event based on the semantic distance and the temporal distance. The display unit 107 enables the user to visualize the influence of the selected feature on the target event.

FIG. 2 is a block diagram illustrating a process for computing temporal distance between events, according to an embodiment of the present disclosure. The temporal distance also referred as temporal relatedness is measured in terms of temporal correlation between two events, where one event is a target event. The temporal correlation is a measure of correlation of events across time. The temporal correlation assigns a value to the co-occurrence of events over a time lag. The influence of an event on a target event changes over a period of time in a specific manner. The influence of other events on a particular event is measured in terms of temporal correlation. The data collected over a period of time reflects in temporal relatedness. Therefore, a event related time series data 201, extracted from the big data (combination of structured, unstructured and semi structured data) 101 is used to compute temporal correlations. The event related time series data 201 is taken as input in a temporal distance estimator. The temporal distance estimator performs the computation of temporal correlation 202 based on the event related time series data 201. If an event always co-occurred with the target event at all instances of time and the magnitude of both the target event and the other event remain proportional, then the temporal correlation of two events is said to be high. The temporal correlation can be either positive of negative, where the positive correlation indicates a high probability of two events co-occurring and the negative correlation indicates a high chance of two events not co-occurring. The computed temporal distances between any two events are then stored in the data structure 105. The user queries the data structure 105 for visualizing the influence of the one or more events on the target event.

FIG. 3 is a block diagram illustrating a process for computing semantic distance events, according to an embodiment of the present disclosure. The semantic distance also referred as semantic relatedness defines a contextual/semantic relation between any two events. For computation of semantic distance, information is extracted from big data (the combination of structured, semi-structured and unstructured data) 101, the events 301, one or more domain models 302 and language models 303. The method of computation adopted by semantic distance estimator depends on nature of data available. At first, event data is first received as input which is extracted from big data 101. The computation of contextual or semantic similarity 304 is performed by semantic distance estimator based on the type of event data 301 received. For instance, when the event data is described by words, the semantic distance is computed by measuring the contextual or semantic distance between the words. The semantic distance estimator also measures the contextual similarity on the basis of data provided by ontologies, domain models 302 and language repositories 303, on the basis of language models 304 or upon analyzing relationships and entities. The computed semantic distance is then stored in data structure 105 for displaying to a user.

FIG. 4 is a flow chart illustrating a method for visualizing the effect of events on a target event, according to an embodiment of the present disclosure. The method comprises identifying a plurality of events from the big data (401). The events are extracted from the big data by an event extractor. The semantic distance and the temporal distance are the calculated between every pair of events (402). The semantic distance estimator calculates the semantic distance between every pair of events. Similarly, the temporal distance estimator calculates a temporal distance between every pair of events. The computed semantic distance and temporal distance values are then stored in the data structure. The datastructure is then queried as per the requirement by a user for visualizing the influence of the events on the target event (403). A display unit is then adopted for visually representing the calculated temporal and semantic distances between the at least two events. The visual representation comprises a information coordinate axes in which influencing events are represented by different shapes, colors and sizes, and target event is placed at the coordinate (404). Based on the semantic distance and the temporal distance, the events which are relevant to the target event are automatically identified (405). The most influencing events are near to the target event. The user then selects the plurality of events that influence the target event through visualization (406). Further, the user is also allowed to change the target event and visualize results particular to the new target event interactively (407). Further the steps 404 to 406 are repeated.

FIG. 5 is a graph showing an example illustration of the interactive visualization of events, according to an embodiment of the present disclosure. The visualization of events that influence the target event is represented on an information dimension. The information dimension comprises a new co-ordinate system consisting of semantic and temporal distance axes starting from origin (0, 0). A rectangular region 501 with the two axes as a side is the region where the events are visually represented. The rectangular boxes inside the rectangular region of different patterns represent different events or different type of events. The one or more events are represented by small rectangles (or any other shape). The negatively co-related events are differentiated by the shape. The different events or different types of events can also be represented by different colors and shape. The different types of shapes comprise but not limited to circular, square, rectangular etc. The different shapes indicate different influences (positive and negative). The rectangular box at the coordinate axes (1, 1) is the target event 502. The events closer to the target event 502 have more influence on the target event 502. The highly influential events are spread around the target event 502. While the FIG. 5 is a simple two dimensional representation with (0, 0) as the origin and (1, 1) as the extreme co-ordinate, the embodiments herein provide multiple different representations each for a different purpose. The influence of events that are spread away from the target event 502 goes on decreasing as they approach the origin. The user can set the limits of semantic and temporal distances, and the events that fall, within the rectangle defined by these limits indicate the most influential events. The user is also allowed to change the target event 502 and the graphical representation is re-casted with no effort as the data structure holds all relevant information required to re-cast.

According to an embodiment herein, the data structure storing the semantic distance and the temporal distance with respect to an event-target event pair is in a matrix form, in which the lower and upper triangles hold semantic and the temporal distances respectively. The values corresponding to the semantic and temporal distances in the matrix form vary from zero to one, where one being the closest representing the semantic/temporal distance of an event to itself.

The embodiments herein provide a simpler, intuitive, interactive and easy to deploy visualization process. Also, the interactive visualization uses the semantic and temporal distances to identify the most relevant variables based on the guiding principles of temporal and semantic distances. This interactive visualization procedure works on most simple, heuristic and intuitive principles making the understanding of results easy. The results are displayed with a few clicks, without requiring user input and visualization of influences on an event is made more interactive and intuitive. While the computations of semantic and temporal distances are complex, the complexities are hidden from the user. The embodiments of the present disclosure provide immense benefit in Retail, Healthcare Education, Governance, etc.

The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification. 

What is claimed is:
 1. A method for providing information visualization comprises: identifying a plurality of events from a big data; calculating a temporal distance between at least two events; calculating a semantic distance between the at least two events; storing the calculated semantic distance and the temporal distance with respect to the at least two events in a data structure; providing the semantic distance and the temporal distances calculated between the at least two events on a visual representation; identifying one or more relevant events with respect to the target event based on the semantic distance and the temporal distance through visualization; selecting a plurality of events that influence the target event; and examining a pattern of the plurality of events to realize a collinearity of the events to limit the influencing events considered for feature analysis.
 2. The method of claim 1, wherein realizing the collinearity of the events comprises identifying at least two events sharing an exact relationship and reducing the number of variables to facilitate effective feature selection.
 3. The method of claim 1, further comprises: selecting a new target event; and re-visualizing the influence of the selected features on the new target event.
 4. The method of claim 1, wherein an event is defined as a relationship which occurred at an instant of time.
 5. The method of claim 1, wherein the feature selection is defined as selecting a plurality of variables having an influence on the occurrence of the event.
 6. The method of claim 1, wherein calculating the temporal distance between at least two events comprises computing temporal correlation between the at least two events; wherein the temporal correlation is a measure of correlation of the events across time.
 7. The method of claim 1, wherein the temporal distance is calculated on the basis of time series data obtained from one of a structured data, an unstructured data or a combination of structured data and unstructured data.
 8. The method of claim 1, wherein calculating the semantic distance between the two events comprises at least one of: measuring a contextual distance between one or more words if the plurality of events are described by the words; measuring a contextual similarity or a semantic similarity as provided by a domain model and a language model; and measuring the contextual or the semantic similarity based on analyzing the relationships and the entities.
 9. The method of claim 1, wherein the semantic distance is calculated on the basis of structured data and an unstructured data or a combination of structured data and unstructured data.
 10. The method of claim 1, further comprises storing the semantic distance and the temporal distances between the at least two events in a data structure capable of storing both the semantic distances and the temporal distances separately.
 11. The method of claim 1, wherein a value corresponding to the semantic distance and the temporal distance is in a preset numerical range or correspond to a discrete values range, with an event being closest to itself both temporally and semantically.
 12. The method of claim 1, wherein the representation of the visualization comprises: events represented by a predefined shape; negatively co-related events differentiated by the shape; different events represented by different colors; different event types represented by different colors; highly influential events rendered around the target event; and less influential events rendered away from the targeted event as they approach the origin.
 13. The method of claim 1, wherein further comprises: setting limits for semantic distances and the temporal distances for the plurality of events influencing the target event; and selecting the events falling in a defined limit as the highly influential events.
 14. A system for providing information visualization comprises: an event extractor to extract a plurality of events from a big data; a semantic distance estimator to calculate a semantic distance between at least two events; a temporal distance estimator to calculate a temporal distance between the at least two events; a data structure to store to store the calculated semantic distance and the temporal distance with respect to the at least two events; an user interface provided on a user device to: display one or more relevant events with respect the target event based on the semantic distance and the temporal distance; and provide an interactive input to the data structure to select selecting a plurality of events that influence the target event; and a display unit for visualizing the influence of the selected features on the target event.
 15. The system of claim 14, wherein the temporal distance estimator calculates the temporal distance between at least two events by computing a temporal correlation between the at least two events across time.
 16. The system of claim 14, wherein the semantic distance estimator calculates the semantic distance between the two events by at least one of: measuring a contextual distance between one or more words if the plurality of events are described by the words; measring a contextual similarity or a semantic similarity as provided by a domain model and a language model; and measuring the contextual similarity or the semantic similarity based on analyzing the relationships and the entities.
 17. The system of claim 14, wherein the semantic distances and the temporal distances are stored in a data structure which allows separate storage and easy retrieval of each of the semantic distance and the temporal distance with respect to an event-target event pair. 